1 Introduction

1.1 Background

This article aims to meet the second LBB (Learn by Building) assignment from Algoritma, i.e. Data Visualization course. This article is based on P4DS LBB question22222222question020/00//0article. However, there are some improvements which are as follows:

  • The usage of dplyr package is emphasized.
  • More research questions are provided.
  • More figures are provided.
  • ggplot2 and plotly packages are utilized to plot the data instead of base plot.

In this article, you could expect to find the following sections:

  • Introduction of the article can be found in here Introduction
  • More detailed explanation of the data is in Dataset
  • How to prepare the data can be read in Preparation
  • Steps to clean and pre-process the data is in Data Pre-processing
  • Discussion about solving the pre-defined problems is described in Analysis
  • Conclusions of the article can be read in Conclusions

1.2 Objectives

By arranging this article, I’m going to tackle with these problems as follows:

  1. What is the job title paid the most in each area in every given year?
  2. Which top 5 sectors do pay salary the most in Boston every year?
  3. Is it true that a company with more than 5000 employees in San Francisco pays its employees’ salary higher than a startup (a company with employees equal or less than 200) in NYC? Show the data in every year!
  4. What is the relationship between salary and job openings in the US?
  5. Compare salaries of each area from Metro Median Pay and US Median Pay from Quick Facts!
  6. Show distributions of company size’s salary and job openings!

2 Dataset

The data are compiled monthly by Glassdoor from 2016. Each monthly datum is in .xlsx extension file. The columns of each of it consist of:

  • Metro

Some metropolitan areas in the US. The National value indicates data from national area.

  • Dimension Type

Type of data we want, e.g. Quick Facts (facts about median salary and job openings), Industry (sectors of the jobs), and so on.

  • Month

The time format of the data.

  • Dimension

This is the derivative of Dimension Type column. All options here depend on Dimension Type.

  • Measure

This column determines what method of measurement used.

  • Value

Depending on Dimension Type and Dimension columns you chose, this Value column contains either median salary or number of job openings. The value with ‘$’ (dollar) sign indicates salary in USD.

  • YoY

Year over year in percent. For more information about how to understand and calculate YoY, you can read this article I found really helpful. However, in this article, I won’t highlight this YoY feature.

Below is a quick look of the data in spreadsheet application.

For more information regarding the data, you can read its methodology written exclusively by Glassdoor here.

3 Preparation

We have identified the data, so now we’re going to prepare the data. Although the total data started from 2016, in this article, I only use monthly data from 2018 and 2019; there are 24 .xlsx files. It will be time consuming and exhausting if we download each file one by one manually. Therefore, here we’re going to scrape the Glassdoor web page and download all necessary files automatically.

3.3 Load the Data into RStudio

In order to load each monthly xlsx file into RStudio, we define a function select_files_to_read().

And now call the select_files_to_read() then insert it into another function loadMultipleFiles(). The latter works to iterate select_files_to_read() function in specified number of years fully.

After we’ve defined both functions above, now we’re going to use them by setting their arguments in order to load all monthly files in 2018 and 2019. The outputs of below chunk are dataframes of each monthly file.

4 Data Pre-processing

First of all, since the file used is big enough, it is recommended to open it in any Spreadsheet application in advance to obtain and explore a quick insight regarding the data.

After viewing the file in Spreadsheet application more detail as shown in the picture in Introduction, firstly, we focus on Dimension Type column as it explains different information as follows:

  1. Company Size

This value has the following Dimensions:

  • <51 employees
  • 51-200 employees
  • 201-500 employees
  • 501-1000 employees
  • 1001-5000 employees
  • +5000 employees
  • Other: we should highlight this!
  • Empty value: we should highlight this!

This column also has the following Measures:

  • Job Openings (with format <numbers> only in Value column)
  • Median Base Pay (with format either <numbers>.0 or <numbers> in Value column)
  1. Industry

This value has a number of list of sectors. I’m not going to write it down here. And, this value has the following Measures:

  • Job Openings (with format <numbers> only in Value column)
  • Median Base Pay (with format either <numbers>.0 or <numbers> in Value column)
  1. Job Title

This value has a number of list of job titles. I’m not going to write it down here. And, this value has one Measure only, i.e. Median Base Pay (with format either $ <numbers>,<numbers> or $<numbers>,<numbers> in Value column).

  1. Quick Facts

This value’s Dimension and Measures are perfectly equal which are the followings:

  • Job Openings (with format <numbers>,<numbers> in Value column)
  • Labor Force Size (with format <numbers>,<numbers>,<numbers> in Value column)
  • Metro Job Openings (with format <numbers>,<numbers> in Value column)
  • Metro Median Pay (with format either $ <numbers>,<numbers> or $<numbers>,<numbers> in Value column)
  • Total Employment (with format <number>,<numbers>,<numbers> in Value column)
  • US Job Openings (with format <number>,<numbers>,<numbers> in Value column)
  • US Median Pay (with format $ <numbers>,<numbers> in Value column)
  • Unemployment Rate (with format <number>.<number>% in Value column)
  1. Timeseries

This value’s Dimension and Measures are perfectly equal which are the followings:

  • Job Openings (with format either <numbers>.<numbers> or <numbers> in Value column)
  • Median Base Pay (with format either <numbers>.<numbers> or <numbers> in Value column)

Particulary for Labor Force Size, Total Employment, and Unemployment Rate in Quick Facts, I don’t need them as their data are too limited. I will remove those later in the next section.

After we saw the data in Excel-like app already, we’re going to back to R to manipulate the data. First, let’s see the data structure.

## 'data.frame':    72455 obs. of  7 variables:
##  $ Metro         : chr  "National" "Atlanta" "Atlanta" "Atlanta" ...
##  $ Dimension Type: chr  "Quick Facts" "Quick Facts" "Quick Facts" "Quick Facts" ...
##  $ Month         : chr  "2018-04" "2018-04" "2018-04" "2018-04" ...
##  $ Dimension     : chr  "U.S. Median Pay" "Metro Median Pay" "U.S. Median Pay" "Job Openings" ...
##  $ Measure       : chr  "U.S. Median Pay" "Metro Median Pay" "U.S. Median Pay" "Job Openings" ...
##  $ Value         : chr  "$ 51,927" "$ 53,732" "$ 51,927" "118,496" ...
##  $ YoY           : chr  "1.2%" "1.9%" "1.2%" "16.5%" ...

Based on the information above, we need to change data type a number of columns. However, before we begin do that, for the sake of convenience, we need to fix two columns’ name: - “Dimension Type” to “Dimention_Type” - “Month” to “Date”

Then, referring to the data structure above, all columns’ data type are in character data type initially. By using our judgement, we’re sure the columns of Date, Value, and YoY should be in date, numeric, and numeric data type respectively. But, for the rest of columns, we have to make sure whether they should be left as they are or be changed to factor data type.

4.1 Drop NA Value

As a starter, let’s check whether NA exists.

##          Metro Dimension_Type           Date      Dimension        Measure 
##              0              0              0             88              0 
##          Value            YoY 
##            297          45671

We can see that NA exist in Dimension, Value, and YoY columns. Let’s see each of column with NA values.

From the dataframe above, it can be seen that the 88 NAs occur on Dimension_Type == "Company Size" and Measure == "Job Openings" only.

It can be seen from dataframe above that the 297 NA values in Value column only happen in Dimension_Type == "Timeseries", Dimension == "Job Openings, and Dimension == "Job Openings.

From dataframe above, it can be seen that there are lots of NAs in YoY column. However, since we’re not interested in YoY in this article, it’s not a problem if we completely drop this column. Then, we clear all NAs in the other two columns.

##          Metro Dimension_Type           Date      Dimension        Measure 
##              0              0              0              0              0 
##          Value 
##              0

In addition, as I pointed earlier, I won’t use Dimension == "Labor Force Size", Dimension == "Total Employment", and Dimension == "Unemployment Rate". So, let’s drop them down.

##          Metro Dimension_Type           Date      Dimension        Measure 
##              0              0              0              0              0 
##          Value 
##              0

Excellent! All data clean! Then, now let’s move on to set the data type of each column.

4.2 Change Data Type

  • Metro column
##  [1] "National"      "Atlanta"       "New York City" "Los Angeles"  
##  [5] "Philadelphia"  "Houston"       "Seattle"       "San Francisco"
##  [9] "Chicago"       "Boston"        "Washington DC" "U.S."         
## [13] "New-York-City" "Los-Angeles"   "San-Francisco" "Washington-DC"

For Metro column, there are 16 unique observations only. However, if you see them more closely, there are some areas with the same meaning but in slight different typing. Therefore, it will be more easier to handle if we tackle with this problem and change this column to factor data type.

##  [1] "Atlanta"       "Boston"        "Chicago"       "Houston"      
##  [5] "Los Angeles"   "National"      "New York City" "Philadelphia" 
##  [9] "San Francisco" "Seattle"       "Washington DC"

Great! We’ve solved the problem and changed the data type of Metro column.

  • Dimension_Type column
## [1] "Quick Facts"  "Job Title"    "Company Size" "Industry"     "Timeseries"

For Dimension_Type column, there are 5 unique observations only so that no wonder if we need to change its data type to factor immediately.

  • Dimension column
##   [1] "U.S. Median Pay"                  "Metro Median Pay"                
##   [3] "Job Openings"                     "Software Engineer"               
##   [5] "Programmer Analyst"               "Consultant"                      
##   [7] "Project Manager"                  "Store Manager"                   
##   [9] "Business Analyst"                 "Cashier"                         
##  [11] "Financial Analyst"                "Administrative Assistant"        
##  [13] "Accountant"                       "Sales Representative"            
##  [15] "Professor"                        "Registered Nurse"                
##  [17] "Marketing Manager"                "Operations Manager"              
##  [19] "Product Manager"                  "Sales Manager"                   
##  [21] "Teacher"                          "Bank Teller"                     
##  [23] "Customer Service Manager"         "Physical Therapist"              
##  [25] "Mechanical Engineer"              "Recruiter"                       
##  [27] "Design Engineer"                  "Pharmacy Technician"             
##  [29] "Graphic Designer"                 "Web Developer"                   
##  [31] "Business Development Manager"     "Data Scientist"                  
##  [33] "Buyer"                            "Data Analyst"                    
##  [35] "Electrical Engineer"              "Attorney"                        
##  [37] "Human Resources Manager"          "Technician"                      
##  [39] "Restaurant Cook"                  "Production Manager"              
##  [41] "Manufacturing Engineer"           "Retail Key Holder"               
##  [43] "Paralegal"                        "Java Developer"                  
##  [45] "Warehouse Associate"              "Property Manager"                
##  [47] "Producer"                         "Quality Engineer"                
##  [49] "Operations Analyst"               "Medical Assistant"               
##  [51] "Tax Manager"                      "Technical Editor"                
##  [53] "Truck Driver"                     "Insurance Agent"                 
##  [55] "Medical Technologist"             "Field Engineer"                  
##  [57] "Machine Operator"                 "Communications Manager"          
##  [59] "Construction Laborer"             "Emergency Medical Technician"    
##  [61] "Web Designer"                     "Maintenance Worker"              
##  [63] "Barista"                          "Bartender"                       
##  [65] "Civil Engineer"                   "Claims Adjuster"                 
##  [67] "Certified Nursing Assistant"      "Customer Service Representative" 
##  [69] "Delivery Driver"                  "Financial Advisor"               
##  [71] "Licensed Practical Nurse"         "Loan Officer"                    
##  [73] "Maintenance Technician"           "Material Handler"                
##  [75] "Network Engineer"                 "Office Manager"                  
##  [77] "Pharmacist"                       "Process Engineer"                
##  [79] "Program Manager"                  "Research Assistant"              
##  [81] "Restaurant Manager"               "Security Officer"                
##  [83] "Server"                           "Solutions Architect"             
##  [85] "Systems Engineer"                 "Technical Support"               
##  [87] "UX Designer"                      "1001-5000"                       
##  [89] "<51"                              "201-500"                         
##  [91] "5000+"                            "501-1000"                        
##  [93] "51-200"                           "Accounting & Legal"              
##  [95] "Aerospace & Defense"              "Architecture & Civil Engineering"
##  [97] "Arts & Entertainment"             "Automotive"                      
##  [99] "Banking & Financial Services"     "Beauty & Fitness"                
## [101] "Biotech & Pharmaceuticals"        "Business Services"               
## [103] "Computer Software & Hardware"     "Construction"                    
## [105] "Consulting"                       "Consumer Electronics"            
## [107] "Consumer Services"                "Education & Schools"             
## [109] "Energy & Utilities"               "Facilities"                      
## [111] "Farming & Agriculture"            "Government"                      
## [113] "Health Care & Hospitals"          "Information Technology"          
## [115] "Insurance"                        "Internet & Tech"                 
## [117] "Manufacturing"                    "Marketing & Advertising"         
## [119] "Media & Publishing"               "Nonprofit"                       
## [121] "Private Security"                 "Real Estate"                     
## [123] "Recruiting & Staffing"            "Restaurants & Bars"              
## [125] "Retail"                           "Supermarkets"                    
## [127] "Telecommunications"               "Transportation & Logistics"      
## [129] "Travel & Tourism"                 "Wholesale"                       
## [131] "Metro Median Base Pay"            "IT Manager"                      
## [133] "U.S. Job Openings"                "Metro Job Openings"              
## [135] "Other"
## [1] 135

For Dimension column, there 135 unique observations. That’s a big number! So, we just leave its data type as it is.

  • Measure column
## [1] "U.S. Median Pay"    "Metro Median Pay"   "Job Openings"      
## [4] "Median Base Pay"    "U.S. Job Openings"  "Metro Job Openings"

For Measure column, there are 6 unique observations only. So, let’s change its data type to factor.

  • Date column

While we’re sure Date column should be in date data type, we still need to arrange its data type. Before we begin, let’s take a quick look at the head and tail of such column.

## [1] "2018-04" "2018-04" "2018-04" "2018-04" "2018-04" "2018-04"
## [1] "2019-09" "2019-09" "2019-09" "2019-09" "2019-09" "2019-09"

We saw that the format of the column is in year-month, without date. However, since the total number of rows of the data is 50015, we are not sure yet whether all observations of such column are written in the same format. What if there are some row written in year-month-date or even other format?

To answer this, let’s see the column deeper by using unique().

Bingo! As I told you, there are some with year-month-date format. Therefore, we’ve got two formats in one column. And since this is annoying, we have to unify all observations to one format only. So, we have two format options: year-month or year-month-date.

If we were to choose the first option, we will lose other observations’ date information so that this option is not affordable. Then, if we were to choose another option, the consequence is that we have to coerce the year-month format oservations to year-month-date format. Simply, we could insert ‘dummy’ date, i.e. 01, at the end of each year-month observation. Therefore, I think it will be more comfortable if we select the second choice.

  • Value column

We stated earlier that Value should be in numeric data type. Furthermore, it was mentioned in the beginning of Data Cleansing that there is a number of Value column formats (after removing some rows). They are as follows:

  • <numbers>
  • <numbers>.0
  • <numbers>,<numbers>
  • $ <numbers>,<numbers>
  • $<numbers>,<numbers>
  • <number>,<numbers>,<numbers>

Before we jump to data type change, we have to handle those formats to make them in uniform format. We can use the following codes.

Well done! We have adjusted all columns’ format. So, now let’s check it the data out to convince ourself that we did well.

Alright. All columns have been in appropriate data type. For the next step, let’s move on to the next section below.

4.3 Data Transformation

As our data is too complicated to infer directly from one table, it will be more convenient if we split it into several subtables: salary for each job title, salary for each sector, salary for each company size, job openings for each sector, job openings for each company size, quick facts, and time series.

4.3.1 Create a Table “Salary for each job title”

This subtable consists of median salary for each job title in a given specific metro area and on a specific date. To create it, we extract some columns from our main data. We’re only interested in Dimension_Type of "Job Title".

Great! We’ve got what we desired. As we have new table, it’s gonna be better if we assign Job Title column data type to factor because job titles are just an iteration for each area and date.

##  [1] "Accountant"                      "Administrative Assistant"       
##  [3] "Attorney"                        "Bank Teller"                    
##  [5] "Barista"                         "Bartender"                      
##  [7] "Business Analyst"                "Business Development Manager"   
##  [9] "Buyer"                           "Cashier"                        
## [11] "Certified Nursing Assistant"     "Civil Engineer"                 
## [13] "Claims Adjuster"                 "Communications Manager"         
## [15] "Construction Laborer"            "Consultant"                     
## [17] "Customer Service Manager"        "Customer Service Representative"
## [19] "Data Analyst"                    "Data Scientist"                 
## [21] "Delivery Driver"                 "Design Engineer"                
## [23] "Electrical Engineer"             "Emergency Medical Technician"   
## [25] "Field Engineer"                  "Financial Advisor"              
## [27] "Financial Analyst"               "Graphic Designer"               
## [29] "Human Resources Manager"         "Insurance Agent"                
## [31] "IT Manager"                      "Java Developer"                 
## [33] "Licensed Practical Nurse"        "Loan Officer"                   
## [35] "Machine Operator"                "Maintenance Technician"         
## [37] "Maintenance Worker"              "Manufacturing Engineer"         
## [39] "Marketing Manager"               "Material Handler"               
## [41] "Mechanical Engineer"             "Medical Assistant"              
## [43] "Medical Technologist"            "Network Engineer"               
## [45] "Office Manager"                  "Operations Analyst"             
## [47] "Operations Manager"              "Paralegal"                      
## [49] "Pharmacist"                      "Pharmacy Technician"            
## [51] "Physical Therapist"              "Process Engineer"               
## [53] "Producer"                        "Product Manager"                
## [55] "Production Manager"              "Professor"                      
## [57] "Program Manager"                 "Programmer Analyst"             
## [59] "Project Manager"                 "Property Manager"               
## [61] "Quality Engineer"                "Recruiter"                      
## [63] "Registered Nurse"                "Research Assistant"             
## [65] "Restaurant Cook"                 "Restaurant Manager"             
## [67] "Retail Key Holder"               "Sales Manager"                  
## [69] "Sales Representative"            "Security Officer"               
## [71] "Server"                          "Software Engineer"              
## [73] "Solutions Architect"             "Store Manager"                  
## [75] "Systems Engineer"                "Tax Manager"                    
## [77] "Teacher"                         "Technical Editor"               
## [79] "Technical Support"               "Technician"                     
## [81] "Truck Driver"                    "UX Designer"                    
## [83] "Warehouse Associate"             "Web Designer"                   
## [85] "Web Developer"

4.3.2 Create a Table “Salary for each sector”

This subtable consists of median salary for each sector in a given specific metro area and on a specific date. To create it, we extract some columns from our main data. We’re only interested in Dimension_Type of "Industry" and Measure of "Median Base Pay".

As we have new table, it’s gonna be better if we assign Sector column data type to factor because sectors are just an iteration for each area and date.

##  [1] "Accounting & Legal"               "Aerospace & Defense"             
##  [3] "Architecture & Civil Engineering" "Arts & Entertainment"            
##  [5] "Automotive"                       "Banking & Financial Services"    
##  [7] "Beauty & Fitness"                 "Biotech & Pharmaceuticals"       
##  [9] "Business Services"                "Computer Software & Hardware"    
## [11] "Construction"                     "Consulting"                      
## [13] "Consumer Electronics"             "Consumer Services"               
## [15] "Education & Schools"              "Energy & Utilities"              
## [17] "Facilities"                       "Farming & Agriculture"           
## [19] "Government"                       "Health Care & Hospitals"         
## [21] "Information Technology"           "Insurance"                       
## [23] "Internet & Tech"                  "Manufacturing"                   
## [25] "Marketing & Advertising"          "Media & Publishing"              
## [27] "Nonprofit"                        "Private Security"                
## [29] "Real Estate"                      "Recruiting & Staffing"           
## [31] "Restaurants & Bars"               "Retail"                          
## [33] "Supermarkets"                     "Telecommunications"              
## [35] "Transportation & Logistics"       "Travel & Tourism"                
## [37] "Wholesale"

4.3.3 Create a Table “Salary for each company size”

This subtable consists of median salary for each company size in a given specific metro area and on a specific date. To create it, we extract some columns from our main data. We’re only interested in Dimension_Type of "Company Size" and Measure of "Median Base Pay".

Next, we’re going to assign Company_Size column data type to factor because company sizes are just an iteration for each area and date.

## [1] "<51"       "1001-5000" "201-500"   "5000+"     "501-1000"  "51-200"

4.3.4 Create a Table “Job openings for each sector”

This subtable consists of number of job openings and its YoY for each sector in a given specific metro area and on a specific date. To create it, we extract some columns from our main data. We’re only interested in Dimension_Type of "Industry" and Measure of "Job Openings".

And now, we assign Sector column data type to factor because sectors are just an iteration for each area and date.

##  [1] "Accounting & Legal"               "Aerospace & Defense"             
##  [3] "Architecture & Civil Engineering" "Arts & Entertainment"            
##  [5] "Automotive"                       "Banking & Financial Services"    
##  [7] "Beauty & Fitness"                 "Biotech & Pharmaceuticals"       
##  [9] "Business Services"                "Computer Software & Hardware"    
## [11] "Construction"                     "Consulting"                      
## [13] "Consumer Electronics"             "Consumer Services"               
## [15] "Education & Schools"              "Energy & Utilities"              
## [17] "Facilities"                       "Farming & Agriculture"           
## [19] "Government"                       "Health Care & Hospitals"         
## [21] "Information Technology"           "Insurance"                       
## [23] "Internet & Tech"                  "Manufacturing"                   
## [25] "Marketing & Advertising"          "Media & Publishing"              
## [27] "Nonprofit"                        "Other"                           
## [29] "Private Security"                 "Real Estate"                     
## [31] "Recruiting & Staffing"            "Restaurants & Bars"              
## [33] "Retail"                           "Supermarkets"                    
## [35] "Telecommunications"               "Transportation & Logistics"      
## [37] "Travel & Tourism"                 "Wholesale"

4.3.5 Create a Table “Job openings for each company size”

This subtable consists of number of job openings and its YoY for each company size in a given specific metro area and on a specific date. To create it, we extract some columns from our main data. We’re only interested in Dimension_Type of "Company Size" and Measure of "Job Openings".

We remember that as mentioned at the beginning of Data Cleansing Section, in addition to its normal values, Company Size also has abnormal values, namely “Other” and NA. We tackled the NAs already in the previous section, but not yet for “Other”. Therefore, firstly, let’s check if “Other” still exists in our data.

## [1] "1001-5000" "201-500"   "5000+"     "501-1000"  "51-200"    "<51"      
## [7] "Other"

Okay. It’s still there. Let’s remove it and convert the column to factor data type.

4.3.6 Create a Table “Quick facts”

This subtable contains Median Pay and Job Openings data from Glassdoor and US BLS (Bureau of Labor Statistics) for each area.

NAs identified! This can happen because after we’ve made the table wider, not each member of Dimension has similar Date attribute. That’s why some have certain dates, whereas others have other dates. For this case, I’m going to split up the table based on the columns after Date column and remove the NAs subsequently. So, we will have 5 smaller tables.

  • Quick Facts for US Median Pay
##          Area          Date US_Median_Pay 
##             0             0             0
  • Quick Facts for Metro Median Pay
##             Area             Date Metro_Median_Pay 
##                0                0                0
  • Quick Facts for Job Openings
##        Area        Date JobOpenings 
##           0           0           0
  • Quick Facts for US Job Openings
##           Area           Date US_JobOpenings 
##              0              0              0
  • Quick Facts for Metro Job Openings
##              Area              Date Metro_JobOpenings 
##                 0                 0                 0

4.3.7 Create a Table “Time Series”

This subtable comes from Timeseries column in the main dataframe. It’s going to have data of median salary and job openings in certain time interval. It will be more convenient if we split it up into two smaller dataframe.

  • Time Series for Job Openings
##        Metro         Date Job_Openings 
##            0            0            0

Cool! Nothing wrong happens. Let’s move on to Time Series for Median Salary.

  • Time Series for Median Salary

Though we don’t see any NA value, it’s not a bad idea if we check the dataframe.

##  Metro   Date Salary 
##      0      0      0

Great! No NA at all!

And now, we have all transformed the data and are ready for the next section to analyze the data.

5 Analysis

Finally, we arrive in this section. Here, we’re going to solve all problems defined in Introduction. So, let’s get it started.

5.1 What is the job title paid the most in each area in every given year?

In order to answer this question, we need Salary_Jobs dataframe. Afterwards, we filter the data by its each year and each area. Next, for all rows filtered, calculate mean of salary for each job title in 12 months for each year, and then find the max salary from all averaged salary of job titles. The codes to perform these steps can be find below. Firstly, I create a function.

By using above function, we can find highest paid job for each area every year.

  • The highest paid job in 2018

Okay! We’ve got the data for 2018. Let’s plot it!

And let’s see the plot in an interactive way.

  • The highest paid job in 2019

Do the same for the 2019 data.

And let’s see the plot in an interactive way.

Nice plots. In order to make the reader easier to compare 2018 and 2019, it will be better if we put both plots side by side. Then, let’s combine both data first.

Alright! The data are ready, and let’s plot again.

And let’s see the plot in an interactive way.

Looking at figures above, Attorney and Pharmacist mostly lead all jobs’ salary.

5.2 Which top 5 sectors do pay salary the most in Boston every year?

In order to answer this question, we need Salary_Sectors dataframe. Afterwards, we filter the data by its each year and Boston area. Next, for all rows filtered, calculate mean of salary for each job title in 12 months for each year, and then find the top 5 sectors with the highest salary. The codes to perform these steps can be find below.

We’ve got the data, and now let’s plot it.

And let’s see the plot in an interactive way.

## Warning in geom2trace.default(dots[[1L]][[2L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomLabel() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues

## Warning in geom2trace.default(dots[[1L]][[2L]], dots[[2L]][[1L]], dots[[3L]][[1L]]): geom_GeomLabel() has yet to be implemented in plotly.
##   If you'd like to see this geom implemented,
##   Please open an issue with your example code at
##   https://github.com/ropensci/plotly/issues

From above figures, we can see that top 4 sectors in 2018 and 2019 are the same. Those sectors stay remained in big four for 2 years. The second highest sector, Biotech and Pharmaceuticals, strengthens the relationship between the first and the second questions that Pharmacists, as one of highest paid job, perhaps also work at Biotech and Pharmaceuticals sector which is also one of sectors with highest salary.

5.3 Is it true that a company with more than 5000 employees in San Francisco pays its employees’ salary higher than a startup (a company with employees equal or less than 200) in NYC? Show the data in every year!

In order to answer this question, we need Salary_CompanySize dataframe. Afterwards, we filter the data by San Francisco and New York City areas and company sizes of more than 5000 employees and equal or less than 200.

Nice! We’ve got our subset dataframes. Subsequently, in order to answer the third question, it will be too risky if we employ average directly to combine all monthly data in San Francisco and NYC. Therefore, to capture all information missed by the mean method, we’re going to use scatter plot for each area.

And let’s see the plot in an interactive way.

By looking at the figures above, it is proven that for 2018 and 2019, big companies in San Francisco always pay their employees higher than startups in New York City pay. Therefore, there is an advice for this case:

If you were to work in the US and choose either to work in San Francisco or New York City, it is recommended to apply for a job at a big company. It produces more money for you!

5.4 What is the relationship between salary and job openings in the US?

In order to answer this question, we need TimeSeries_Salary and TimeSeries_JobOpenings dataframes. First of all, let’s see the distributions of each dataframe.

  • Time Series Salary

And now let’s see the plot in an interactive way.

From two figures above, we can see that the salary gradually increases every year, and this occurs evely in each area. With its distribution, San Francisco leads as the area with the highest salary, whereas Atlanta remains the lowest.

  • Time Series Job Openings

And now let’s see the plot in an interactive way.

From two figures above, we can see that the job openings gradually tend to grow every year, and this occurs evely in each area. With its distribution, New York City leads as the area with the highest salary (and perhaps the longest distribution), whereas the lowest ones fall on several areas. e.g. Atlanta and Philadelphia.

Afterwards, we group each dataframe in their months and years, make an average of salary and job openings for each group, and then sort them by month and year. After we’ve got the new dataframes, we can plot them and calculate the correlation.

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

And now let’s see the plot in an interactive way.

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

And then calculate the correlation.

## [1] "The corelation value between Salary & Job Openings in the US: -0.363"

As seen in the figures above, almost there is a little relationship between salary and job openings. We could say that in majority, small number (less than 1 million) of job openings spreads somewhat evenly on the salary. High density of the figure can be seen between ranges x: 50K-65K and y:0-1M. Meanwhile, there are some outliers in which small number of salaries is open in a large number of job openings.

In addition to analyze the relationship by viewing on the figure, we can use correlation formula to examine both salary and job openings. We also can see below the figure above that the correlation value of such relationship is -0.363. This means that both have a small negative relationship which means when the salary goes up, the job openings somewhat goes down, and vice versa.

5.5 Compare salaries of each area from Metro Median Pay and US Median Pay from Quick Facts!

In order to solve this problem, we need two dataframes, called QuickFacts_MetroMedianPay and QuickFacts_USMedianPay. Next, we group each in their areas and years to find the mean of salary. After this step has been done, we can combine both dataframes to compare their salary values and plot them.

And now let’s see the plot in an interactive way.

From both figures above, we can infer that despite of being slight, there exists a rise of median salary from 2018 to 2019 on each median salary type. In addition, we can see clearly that the salary of Metro Median Pay always stays higher than that of US Median Pay.

5.6 Show distributions of company size’s job openings and salary!

In order to solve this problem, we need two dataframes, i.e. JobOpening_CompanySize and Salary_CompanySize. Let’s plot each of them.

  • Job Openings Based on Company Size

And now let’s see the plot in an interactive way.

From the two figures above, we can see that the majority of company size stays between 0 and 50,000 job openings, whereas the companies with more than 5000 employees tend to open much more job openings, even those in New York City. Probably, this indicates that big corporates have a tendency to possess more budgets to recruit new employees.

Next, let’s see the distribution for company size based on their salary.

And let’s see the plot in an interactive way.

From two figures above, we see something interesting. We can conclude that the more a company has employees, the higher the probability the salary it offers. This interpretation also supports our previous statement in the third question that a big corporate tends to pay higher for its employees.

6 Conclusions

We’ve finished all stages to deliver this article. And now, let’s conclude all of them:

  • We’ve explained many things about the data used in Dataset section.
  • We’ve prepared the data, i.e. scraping the Glassdoor website, downloading the files, importing the files to RStudio, and preparing a single dataframe, in Preparation.
  • We’ve performed Data Cleansing to clean the data.
  • We’ve transformed the data in Data Transformation so that it can be easily used in the next sections.
  • We’ve analyzed the data in Analysis to solve all problems.
    1. Mostly the highest paid jobs are dominated by Attorney and Pharmacist.
    2. Energy & Utilities, Biotech & Pharmaceuticals, Aerospace & Defense, Computer Software & Hardware, Consulting, and Architecture & Civil Engineering are top 5 sectors with the highest salary in Boston.
    3. It is true that big corporates in San Francisco pay more for their employees than startups in New York City pay; even the gap is steep enough.
    4. The relationship between salary and job openings is small negative with correlation value -0.363.
    5. Despite of being slight, there exists a rise of median salary from 2018 to 2019 on each median salary type. The salary of Metro Median Pay always stays higher than that of US Median Pay.
    6. For job opening case, probably, there is a tendency that big corporates have more budgets to recruit new employees. Meanwhile, for salary case, the more a company has employees, the higher the probability the salary it offers.